Lightweight herd cloning during shim DMA BD loop unrolling by erwei-xilinx · Pull Request #1535 · Xilinx/mlir-air

erwei-xilinx · 2026-04-12T04:59:21Z

Summary

Avoid O(N × body_size) IR explosion during shim-level loop unrolling in air-opt-shim-dma-bds by skipping deep-clone of herd compute bodies.

When loopUnrollFullWithAsyncTokenPreserved detects a shim-level loop (containing air.SegmentOp or air.HerdOp), it uses a custom manual unroller that:

Fully clones segment bodies — L3 channel ops needed by BD folding are preserved
Creates lightweight herd copies — only channel ops, allocs, deallocs, wait_alls, their transitive operand-defining ops, and the terminator are cloned; heavy compute ops (vector, arith, linalg) are skipped entirely
Uses OperationState to create empty herd shells, avoiding custom builder API issues

Profiling (flash attention 12×4 launch, tiles=2,2)

Metric	Main	This PR
`air-opt-shim-dma-bds`	~50 ms	~43 ms

Current limitation

The lightweight unroller only fires when annotateFn is null (the non-tiled unroll path in loopUnrollFullWithAsyncTokenPreserved). The tiled path, which uses loopUnrollByFactor with an annotateFn callback, still performs full deep-clone unrolling. Extending lightweight cloning to the tiled path is left as a follow-up.

Test plan

ninja check-air-mlir passes (365/374, only pre-existing failures)
New LIGHTWEIGHT FileCheck test in opt_shim_dma_bds.mlir verifies channel ops preserved, compute ops stripped
Flash attention compiles end-to-end on NPU2
Profiling confirms no regression vs main

Copilot

Pull request overview

This PR optimizes the air-opt-shim-dma-bds pass to avoid IR blow-up by skipping runtime-loop tiling and full unrolling when shim-dma-tile-sizes is empty or all-1 (the default path), while preserving scf.for loops for later/lower-cost unrolling in downstream passes.

Changes:

Add a fast path in AIROptimizeShimDMABDs to skip tiling/unrolling when tile sizes are empty or all-1, while still performing L3 DMA folding.
Add a new MLIR regression test covering all-1, non-trivial tiling (2), and empty-tile-size behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`mlir/lib/Transform/AIRDependencyScheduleOpt.cpp`	Adds fast-path logic to skip tiling/unrolling for empty/all-1 tile sizes and attempts to preserve downstream barrier behavior.
`mlir/test/Transform/AIRDependencyScheduleOpt/opt_shim_dma_bds_skip_unroll.mlir`	New test validating skip-unroll behavior and non-trivial tiling behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

erwei-xilinx · 2026-04-13T17:14:52Z

Closing this PR. Hardware E2E testing on local NPU1 confirmed the change produces incorrect DMA BD configurations.

Root cause: Skipping BD folding (applyAIRL3DmaFoldingPatterns) is not safe — the BD folding performs essential loop-to-BD-dimension conversion that changes the channel ops' wraps/strides. Without it, the runtime loop gets unrolled in airrt-to-npu with per-iteration BDs that don't match the hardware's expected multi-dimensional BD format. This produces wrong air.insts.bin (18 KB vs 6.5 KB reference, different offset patterns) and wrong numerical output on NPU1.

What doesn't work: Simply skipping the tiling/unrolling/BD-folding for all-1 tile sizes. The inner BD folding patterns (AIRSpecializeChannelWrapAndStrideInScfFor + AIRUnrollScfForIntoBDChain) are coupled — the isolation pattern splits loops so each channel op gets its own loop, the specialization pattern folds loops into BD dimensions, and the unroll pattern handles cases that can't be folded. Skipping all of them produces fundamentally different (and incorrect) BD configurations.

What would work: A more surgical approach that runs BD folding on the channel ops but prevents AIRUnrollScfForIntoBDChain from unrolling the specific outer runtime loops (from launch conversion). This requires modifying the BD folding patterns to distinguish between "BD loops" (inner loops that should be folded into dimensions) and "runtime loops" (outer loops that should be preserved). This is a larger change that needs careful design.

The compilation speed optimization plan at workspace/compilation_speed_optimization_plan.md remains valid as a design document for the correct approach.

erwei-xilinx · 2026-04-13T18:19:45Z

Closing this PR. The compilation speed optimization (5x on MLIR passes for flash attention) is real and validated by profiling, but all implementation approaches explored so far hit correctness issues:

Skip-unroll approach (defer loop unrolling to airrt-to-npu): Fails because loopUnrollFull in airrt-to-npu doesn't handle airrt::EventType loop-carried values, and purgeSCFParContainingOnlyWaitAllOps unconditionally erases all scf.parallel ops including ones containing DMA ops.
Attribute-based approach (mark runtime loops, skip them in AIRUnrollScfForIntoBDChain): Produces different (incorrect) NPU instruction binaries because BD folding is effectively a no-op on multi-channel-op loop bodies, so the channel ops retain symbolic offsets that downstream passes can't resolve.
Strip segment/herd bodies before unrolling (Strategy A): Fails because air-to-std runs after the shim BD pass on the same module and needs segment bodies to match channel put/get pairs.

The correct approach requires either:

Lightweight cloning in loopUnrollFullWithAsyncTokenPreserved that skips segment/herd body contents during the clone step (only clones the shell + L3 channel ops)
Pipeline restructuring to run air-to-std channel matching before the shim BD pass's unrolling

The profiling analysis and detailed plan remain at workspace/compilation_speed_optimization_plan.md.

erwei-xilinx · 2026-04-27T17:32:16Z

Known limitation: tiled path not covered

The lightweight unroller currently only fires when annotateFn is null — i.e., the non-tiled unroll path inside loopUnrollFullWithAsyncTokenPreserved. The tiled path (triggered by non-trivial shim-dma-tile-sizes like 2,2 or 4,4) goes through loopUnrollByFactor with an annotateFn callback, which still performs full O(N × body_size) deep-clone unrolling.

For the default aircc invocation (tile sizes 1,1), the shim BD pass doesn't tile or unroll — it directly applies BD folding. The lightweight unroller fires when the launch conversion creates scf.for loops that contain segments/herds and need to be unrolled without tiling.

Extending lightweight cloning to the tiled loopUnrollByFactor path is left as a follow-up.

For shim-level scf.for loops that contain air.SegmentOp / air.HerdOp, replace the standard loopUnrollFull (which deep-clones the entire body N times) with a manual unroller that uses lightweight herd cloning. The lightweight herd clone (cloneHerdOpLightweight) creates a new herd shell via OperationState and populates the body with ONLY channel ops, allocs, deallocations, wait_alls, and their transitive operand-defining ops. Heavy compute ops (matrix multiply, vector ops, etc.) are never cloned. The segment clone (cloneSegmentOpLightweight) preserves the full segment body (L3 channel ops needed by BD folding) but applies lightweight cloning recursively to any contained air.HerdOp. This brings flash attention 12x4 (tiles=2,2) from crashing (T002) or taking O(N*body_size) time back to ~50ms, matching the main-branch baseline. Also adds a lit test (func_lightweight_unroll) that verifies channel ops are preserved and compute ops are excluded from the lightweight herd copies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rminating walk Two efficiency improvements identified in self-review: 1. collectHerdBodyOpsToKeep: replace fixed-point iteration (copies toKeep set each round) with a worklist algorithm. Each newly-added op is pushed once and processed once, avoiding redundant re-scanning of already-kept ops in subsequent rounds. 2. loopUnrollFullWithAsyncTokenPreserved: replace two sequential walks (one for SegmentOp, one for HerdOp) with a single interruptible walk that stops as soon as the first SegmentOp or HerdOp is found. Avoids walking the entire IR a second time in the common case where a segment is present. No functional change; build and all tests pass (same 2 pre-existing ROCDL failures unrelated to T003). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 12, 2026 04:59

erwei-xilinx requested a review from fifield as a code owner April 12, 2026 04:59

Copilot started reviewing on behalf of erwei-xilinx April 12, 2026 05:01 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Comment thread mlir/lib/Transform/AIRDependencyScheduleOpt.cpp Outdated

Comment thread mlir/test/Transform/AIRDependencyScheduleOpt/opt_shim_dma_bds_skip_unroll.mlir Outdated

erwei-xilinx force-pushed the erwei/skip-shim-dma-unroll-for-tile1 branch 3 times, most recently from 9fe9375 to 14eb133 Compare April 13, 2026 17:10

erwei-xilinx closed this Apr 13, 2026

erwei-xilinx reopened this Apr 13, 2026

erwei-xilinx force-pushed the erwei/skip-shim-dma-unroll-for-tile1 branch 2 times, most recently from f33ea01 to f5f26c4 Compare April 13, 2026 17:43

erwei-xilinx marked this pull request as draft April 13, 2026 17:52

erwei-xilinx closed this Apr 13, 2026

erwei-xilinx reopened this Apr 13, 2026

erwei-xilinx force-pushed the erwei/skip-shim-dma-unroll-for-tile1 branch from f5f26c4 to bd0d7ed Compare April 13, 2026 19:58

erwei-xilinx marked this pull request as ready for review April 13, 2026 19:58

erwei-xilinx force-pushed the erwei/skip-shim-dma-unroll-for-tile1 branch from bd0d7ed to f56b871 Compare April 13, 2026 21:57

erwei-xilinx force-pushed the erwei/skip-shim-dma-unroll-for-tile1 branch 3 times, most recently from 9ac2f96 to 400cd2a Compare April 27, 2026 17:31

erwei-xilinx changed the title ~~Skip runtime loop unrolling in air-opt-shim-dma-bds for all-1 tile sizes~~ Lightweight herd cloning during shim DMA BD loop unrolling Apr 27, 2026

erwei-xilinx and others added 2 commits April 27, 2026 10:32

erwei-xilinx mentioned this pull request Apr 27, 2026

Extend lightweight herd cloning to tiled unroll path in air-opt-shim-dma-bds #1558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightweight herd cloning during shim DMA BD loop unrolling#1535

Lightweight herd cloning during shim DMA BD loop unrolling#1535
erwei-xilinx wants to merge 2 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/skip-shim-dma-unroll-for-tile1

erwei-xilinx commented Apr 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

erwei-xilinx commented Apr 13, 2026

Uh oh!

erwei-xilinx commented Apr 13, 2026

Uh oh!

erwei-xilinx commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Profiling (flash attention 12×4 launch, tiles=2,2)

Current limitation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

erwei-xilinx commented Apr 13, 2026

Uh oh!

erwei-xilinx commented Apr 13, 2026

Uh oh!

erwei-xilinx commented Apr 27, 2026

Known limitation: tiled path not covered

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented Apr 12, 2026 •

edited

Loading